The goal of the statistical modelling in this study is to find answers to three key questions:
The method of choice for the study was gradient boosted decision trees (GBDT). GBDT is a widely used machine learning technique which can be used in many settings from regression or classification to learning to rank type of problems. In a learning to rank problem, there is a ordered list of items and the goal for the model is to calculate a score for each item based on dependent variables such that the original order is retained.
In process of building the model, data set was split to two folds: train data (containing around 70% of searches) and test data (the rest of the data, about 30%). GBDT model was fitted using training data, predictions were calculated for the test data set, and then finally predictions were compared to real observed rankings. The chosen evaluation metric was Spearman’s rank correlation coefficient. Spearman’s rank correlation is a scaled measurement of the agreement of two rankings. Perfectly matching rankings would give value of 1, the expected value for random rankings is zero and reverse order would have value of -1.
The next step is to understand why the model makes particular predictions; what are the most important dependent variables and how their values effect the predictions? For this purpose SHapley Additive exPlanations (SHAP) values were calculated. In SHAP each prediction is presented as a sum of each dependent variable’s responsibility. Then the overall impact of any particular variable can be measured as a average of absolute values over the whole data set.
The first plot is showing feature importance, that is, each feature’s average contribution to model’s predictions.
Second plot shows the direction of the impact given feature’s value.
Overall, the most interesting features for us to look at are a) high in the first plot and b) show clear pattern in the second plot.
E.g. if we look at the first row and the feature named “Has same city listed as in search query”, we can see kind of polarized distribution of SHAP values around zero. Yellow points correspond to high feature values (in this case, “No”) and here their impact to all predictions in the data set is negative and therefore making model belief that predicted positions should be worse for them. Where as purple points correspond to high feature values (“Yes”) and have positive impact for predicted positions.
Similarly, “Type category is personal injury” and “Type category is personal injury” are similar too the “Has same city listed as in search query”. Ie. if they have value “Yes” they will impact positively to predicted positions.
The depended variables used in this study can be roughly organized into five main groups, these are listed below and also showing a few important variables suggested by SHAP values.
In terms of SEO, the first two categories are not much of a interest as they are something difficult or even impossible to change or adjust, but the last three are more interesting and worth further investigation.
| Type | Value |
|---|---|
| Total unique categories | 72 |
| Missing type category | 1.99% |
| Categories with more than >=10 results | 26 |
| Categories with more than >=100 results | 13 |
| Categories with more than >=1000 results | 3 |
| Median unique categories in one search | 4 |
| Min unique categories in one search | 1 |
| Max unique categories in one search | 12 |
Key takeaways:
| Type | Title | Description |
|---|---|---|
| Median character length (non missing) | 24 | 534 |
| Min character length (non missing) | 4 | 8 |
| Max character length (non missing) | 125 | 752 |
| Missing | 0.01% | 40.7% |
| Containing lawyer or attorney | 22.65% | 43.57% |
| Containing car accident or personal injury | 5.31% | 44.7% |
| Containing city name | 5% | 27.07% |
Key takeaways:
| Type | Value |
|---|---|
| Median #reviews | 14 |
| Max #reviews | 968 |
| No reviews available | 16.59% |
| Average rating | 4.61 |
| Response ratio by owners | 33.43% |
| Average number of likes per review | 0.66 |
Key takeaways:
| Type | ref_domains_dofollow | total_traffic | ahrefs_rank | domain_rating |
|---|---|---|---|---|
| Median | 40 | 82 | 17838462 | 10 |
| Min | 0 | 0 | 4281 | 0 |
| Max | 13379 | 3444072 | 171527697 | 85 |
| Missing | 0.36% | 0.36% | 0.43% | 0.43% |
Key takeaways:
| Type | Value |
|---|---|
| Median #photos | 5 |
| Max #photos | 540 |
| Zero #photos | 5.78% |
| Provides Google updates | 54.79% |